Logistic Regression and Support Vector Machines

This blog post provides a detailed explanation of the key concepts Logistic Regression and Support Vector Machines (SVMs). These are fundamental algorithms in machine learning for classification tasks.

1. Logistic Regression

Logistic Regression is a probabilistic classification model used primarily for binary classification problems. Unlike linear regression, which predicts continuous values, logistic regression outputs the probability that a given input belongs to a particular class (typically the positive class, labeled as y=1). This probability is constrained to the interval [0, 1].

1.1 From Probability to Odds to Log-Odds

Directly modeling probability p (where 0 ≤ p ≤ 1) is challenging because linear models can produce values outside this range. To address this, we transform the probability:

  • Odds: Defined as p / (1 – p), odds range from 0 to ∞.
  • Log-Odds (Logit): The natural logarithm of odds, log(p / (1 – p)), which ranges from -∞ to +∞. This unrestricted range allows us to model it as a linear function of the features.
Transformation from probability to log-odds

Figure 1: Illustration of the transformation from probability to odds and log-odds.

1.2 The Logistic (Sigmoid) Function

By setting the log-odds equal to a linear combination of features, we derive the logistic function:

$$ p(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + w_0) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + w_0)}} $$

Here, \(\sigma(z)\) is the sigmoid function, which maps any real number z to [0, 1], producing an S-shaped curve.

Sigmoid function plot

Figure 2: The sigmoid function mapping real values to probabilities in [0,1].

Another sigmoid plot

Figure 3: Additional visualization of the sigmoid curve.

1.3 Handling Categorical Variables

Real-world datasets often include categorical features:

  • Ordinal variables (ordered categories, e.g., grades A-F): Map to ordered numbers (A=4, B=3, etc.).
  • Nominal variables (unordered, e.g., eye color): Use one-hot encoding to create binary vectors, avoiding implicit ordering.

Example: Eye colors {blue, green, brown} → blue: [1, 0, 0], green: [0, 1, 0], brown: [0, 0, 1]. (Often drop one category to avoid redundancy.)

One-hot encoding example

Figure 4: Example of one-hot encoding for categorical variables.

1.4 Training Logistic Regression: Maximum Likelihood Estimation

Parameters \(\mathbf{w}\) and \(w_0\) are learned by maximizing the likelihood of observing the training data. The log-likelihood function is:

$$ L(\mathbf{w}) = \sum_{i=1}^n \left[ y_i \log p_i + (1 – y_i) \log (1 – p_i) \right] $$

where \( p_i = \sigma(\mathbf{w}^T \mathbf{x}_i + w_0) \).

This is equivalent to minimizing the negative log-likelihood (cross-entropy loss). Since the function is concave, optimization uses iterative methods like gradient descent (no closed-form solution exists).

2. Support Vector Machines (SVMs)

Support Vector Machines are discriminative classifiers that directly learn a decision boundary, outputting class labels (+1 or -1) without probabilities.

2.1 Linear SVM Decision Boundary

The decision rule is:

$$ \hat{y} = \sign(\mathbf{w}^T \mathbf{x} + b) $$

This defines a hyperplane \(\mathbf{w}^T \mathbf{x} + b = 0\).

2.2 Choosing the Optimal Boundary: Maximum Margin

Multiple hyperplanes may separate the data perfectly. SVM selects the one maximizing the margin—the distance to the nearest points (support vectors)—for better generalization and robustness to noise.

Multiple boundaries vs maximum margin

Figure 5: Comparison of possible separating hyperplanes; the maximum margin one is most robust.

Hard vs soft margin illustration

Figure 6: Illustration highlighting the maximum margin principle.

SVM with support vectors

Figure 7: SVM hyperplane with maximum margin and highlighted support vectors.

Maximum margin hyperplane

Figure 8: Detailed view of the maximum margin hyperplane and support vectors.

2.3 Margin Calculation and Optimization

The margins are parallel hyperplanes: \(\mathbf{w}^T \mathbf{x} + b = 1\) (positive) and \(\mathbf{w}^T \mathbf{x} + b = -1\) (negative).

The distance from a point to the hyperplane is \( \frac{|\mathbf{w}^T \mathbf{x} + b|}{||\mathbf{w}||} \). For support vectors, this is \( \frac{1}{||\mathbf{w}||} \), so total margin is \( \frac{2}{||\mathbf{w}||} \).

SVM maximizes the margin by minimizing \( ||\mathbf{w}|| \) (or \( \frac{1}{2} ||\mathbf{w}||^2 \)) subject to:

$$ y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i $$

SVM margin distance calculation

Figure 9: Diagram showing margin calculation and distance to hyperplane.

This constrained optimization yields a robust classifier focused on the most critical data points (support vectors).

Conclusion

Logistic Regression provides probabilistic outputs ideal for interpreting confidence, while SVMs excel in finding robust boundaries for high-dimensional data. Both are cornerstone algorithms in supervised learning.